Towards a More Useful Measure of Dispersion in Corpus Data
نویسنده
چکیده
The most frequently used statistic in corpus linguistics is the frequency of occurrence of some linguistic variable or the frequency of co-occurrence of two or more linguistic variables. However, it is widely known that frequencies of (co-)occurrence may sometimes be quite misleading. For example, Leech, Rayson, and Wilson (2001) show that while the words HIV, keeper, and lively are about equally frequent in the British National Corpus (16 p.m.), HIV is much less specialized such that it occurs in a much smaller number of files than keeper and lively. Similarly, Gries (2006) shows how investigating the association of verbs to the imperative in the ICE-GB can be severely distorted by words that are highly frequent in just one out of 500 files. In order to handle such problems, several scholars suggested a variety of dispersion measures; these include the range, the standard deviation or the variation coefficient, Juilland’s D, Carroll’s D2, Rosengren’s S, the usage coefficient, and inverse document frequency (cf. Oakes 1998 or Rayson 2003 for overviews). However, these coefficients suffer from the problem that they require the corpus parts for which a dispersion measure is computed to be identically large (Oakes 1998: 191), which is usually not true. Likewise, chi-square does not rely on this assumption, but can take on widely varying values depending on whether expected frequencies become very small. In this paper, I will propose for discussion a very simple alternative measure, which (i) allows to quantify dispersion just like the above, (ii) does not rely on the unwarranted assumption of equally-sized corpus parts, and (iii) is not affected by small expected frequencies. I will exemplify this measure using both words and wordconstruction pairings from different frequency bands and with different degrees of dispersion in the British National Corpus Sampler.
منابع مشابه
Measures of dispersion for corpus data: an overview, a suggestion, and a research program II
In order to adjust observed frequencies of occurrence, previous studies have suggested a variety of measures of dispersion and adjusted frequencies. In part I of this article, I first summarily reviewed many of these measures as well as a variety of their shortcomings and then suggested an alternative measure, DP, for deviation of proportions, which I argued to be conceptually simpler, but at t...
متن کاملMoving dispersion method for statistical anomaly detection in intrusion detection systems
A unified method for statistical anomaly detection in intrusion detection systems is theoretically introduced. It is based on estimating a dispersion measure of numerical or symbolic data on successive moving windows in time and finding the times when a relative change of the dispersion measure is significant. Appropriate dispersion measures, relative differences, moving windows, as well as tec...
متن کاملApplication of Outlier Robust Nonlinear Mixed Effect Estimation in Examining the Effect of Phenylephrine in Rat Corpus Cavernosum
Ignoring two main characteristics of the concentration-response data, correlation between observations and presence of outliers, may lead to misleading results. Therefore the special method should be considered. In this paper in to examine the effect of phenylephrine in rat Corpus cavernosum, outlier robust nonlinear mixed estimation is used. in this study, eight different doses of phenylephrin...
متن کاملWhen Frequency Data Meet Dispersion Data in the Extraction of Multi-word Units from a Corpus: A Study of Trigrams in Chinese
One of the main approaches to extract multi-word units is the frequency threshold approach, but the way this approach considers dispersion data still leaves a lot to be desired. This study adopts Gries’s (2008) dispersion measure to extract trigrams from a Chinese corpus, and the results are compared with those of the frequency threshold approach. It is found that the overlap between the two ap...
متن کاملA Systematic Method to Analyze Transport Networks: Considering Traffic Shifts
Current network modeling practices usually assess the network performance at specified time interval, i.e. every 5 or 10 years time horizon. Furthermore, they are usually based on partially predictable data, which are being generated through various stochastic procedures. In this research, a new quantitative based methodology which combines combinatorial optimization modeling and transportation...
متن کامل